Checklist

The submission includes the following.

Group project

For the project, we use the following packages:

##
rm(list = ls())

library(cmdstanr)
Warning: package 'cmdstanr' was built under R version 4.4.3
library(dplyr)
library(splines)
library(janitor)
library(readr)
library(brms)
library(ggplot2)
library(tidyr)
library(priorsense)
options(mc.cores = parallel::detectCores()) # paralellize if possible
options(brms.file_refit = "on_change") # save the files if the model has changed
ggplot2::theme_set(ggplot2::theme_light()) # nicer theme

1. Dataset Selection (0.5pt)

Select a dataset with clusters such as schools, regions, or people with multiple observations per individual. (From for example, https://www.kaggle.com/) It would be a good idea to choose a smallish dataset (not too many rows, e.g., less than 1000) or subset it so that fitting the models doesn’t take too long.

  1. Describe the dataset with a couple of short sentences. What was its intended use? Are there papers that reference it? Provide information on how to obtain the dataset, including its source and any necessary preprocessing steps/feature engineering.

The original dataset was the The Real Estate Sales 2001–2022 dataset, which is a record of residential property transactions over twenty-two years (Masidekabakci, 2024). It includes raw sale prices (Sale.Amount), tax-assessed values (Assessed.Value), transaction dates, and various property attributes (e.g. location, size, year built). The data is grouped by time periods (e.g., months or years) and location coordinates. It is used/created for educational purposes for real estate analysis, or exploring patterns/models in property sales over time, but it is mostly useful for regression tasks in predicting sale amounts. However, it is not considered peer-reviewed, and no major papers have used this exact compilation, but similar datasets are used in the literature and Kaggle competitions. It is pulled from Kaggle, the link is:https://www.kaggle.com/code/masidekabakci/real-estate-sales-2001-2022.

Preprocessing steps we did include parsing the date field, handling or dropping missing values, converting categorical columns to factors, and (for modeling) log-transforming highly skewed columns like ‘sale_amount’ and ‘assessed_value’. We also converted the ‘Location’ column to two columns: ‘lon’ & ‘lat’, for the spatial GP model. Furthermore, we plotted the histogram of ’Sale_Amount. Lastly, we removed outliers outside the inter quartile range.

# load the data and preprocessing steps go here
getwd() # e.g. "/Users/niyaneykova/Desktop/Bayesian Modeling"
[1] "/Users/niyaneykova/Desktop/Bayesian Modeling"
# read from the same directory
real_estate_dataset <- read.csv("/Users/niyaneykova/Desktop/Bayesian Modeling/data/real_estate_dataset.csv")

head(real_estate_dataset)
  List.Year Town Assessed.Value Sale.Amount Property.Type Residential.Type Non.Use.Code
1      2022   48         207700      400000             3                3           11
2      2022   38          49700       29900             3                3            4
3      2022   59         208230      423261             3                3           11
4      2022   53         169050      351000             3                3            4
  Date_Recorded date_ordinal DayOfWeek WeekOfYear Month MonthName Quarter Season
1    2023-09-20        19620         3         38     9         9       3      4
2    2023-02-27        19415         1          9     2         2       1      1
3    2022-10-31        19296         1         44    10        10       4      4
4    2023-03-30        19446         4         13     3         3       1      2
  MonthInSeason DOW_num  Month_sin     Month_cos   Week_sin      Week_cos    DOW_sin
1             1       3 -1.0000000 -1.836970e-16 -0.9927089 -1.205367e-01  0.4338837
2             3       1  0.8660254  5.000000e-01  0.8854560  4.647232e-01  0.7818315
3             2       1 -0.8660254  5.000000e-01 -0.8229839  5.680647e-01  0.7818315
4             1       4  1.0000000  6.123234e-17  1.0000000 -1.608123e-16 -0.4338837
     DOW_cos       lon      lat spatial_cluster
1 -0.9009689 -72.47379 41.78533               3
2  0.6234898 -71.87593 41.57067              16
3  0.6234898 -73.40230 41.51023              27
4 -0.9009689 -72.10181 41.43314               4
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Realestate <- real_estate_dataset
Realestate <- Realestate %>%
  mutate(
    log_sale      = log(`Sale.Amount`),
    log_assessed1 = log(`Assessed.Value` + 1)
  )
# Select the most useful variables
Realestate_clean <- Realestate %>%
  select(`List.Year`, Town, `Sale.Amount`, `Property.Type`, `Residential.Type`, `Assessed.Value`, Season, Non.Use.Code, lon, lat, log_sale, log_assessed1)

# Drop all rows containing missing values and values of zero
Realestate_clean <- Realestate_clean %>%
  drop_na() %>%
  filter(if_all(everything(), ~ . != 0))

# Inspect some variables on there levels
unique(Realestate_clean$Town)
  [1]  48  38  59  53   9  10  39  32  15  45  26  34  52  22   7  16  41   5  36  25  27
 [22]   1  46   3  12  20   4   2  23  30  11  37  64  84  73  85  57  51  66  71  94 106
 [43] 102  92 108 103  13  43  18  47  55  19  14  54  28  35  33  76  89  77  86  75  90
 [64]  91  87  98 101  56  49   8  60  40  82  65  99 105  63  72  81  97  70 100  95  67
 [85]  88  96  68 104  74  29  31  42  44 107  50  58  78  93  80  24  83  61  21  17   6
[106]  69  79
unique(Realestate_clean$`Property.Type`)
[1] 3 4 1 2 6 5
unique(Realestate_clean$`Residential.Type`)
[1] 3 4 1 5 2
str(Realestate_clean)
'data.frame':   557 obs. of  12 variables:
 $ List.Year       : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
 $ Town            : int  48 38 59 53 9 38 53 10 39 59 ...
 $ Sale.Amount     : num  400000 29900 423261 351000 3500000 ...
 $ Property.Type   : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Residential.Type: int  3 3 3 3 3 3 3 4 3 1 ...
 $ Assessed.Value  : num  207700 49700 208230 169050 769900 ...
 $ Season          : int  4 1 4 2 2 3 1 1 4 2 ...
 $ Non.Use.Code    : int  11 4 11 4 11 11 11 11 4 11 ...
 $ lon             : num  -72.5 -71.9 -73.4 -72.1 -72.8 ...
 $ lat             : num  41.8 41.6 41.5 41.4 41.3 ...
 $ log_sale        : num  12.9 10.3 13 12.8 15.1 ...
 $ log_assessed1   : num  12.2 10.8 12.2 12 13.6 ...
# Rename columns for use in brms
Realestate_clean <- clean_names(Realestate_clean)

# Randomly select a 1000 rows as a sample
Realestate_sample <- Realestate_clean %>%
  sample_n(550)

# Convert character variables to factors
Realestate_sample <- Realestate_sample %>%
  mutate(
    Town = as.factor(town),
    `Property.Type` = as.factor(`property_type`),
    `Residential.Type` = as.factor(`residential_type`)
  )

# Scale numeric predictor variables
Realestate_sample <- Realestate_sample %>%
  mutate(
    `assessed_value` = log(`assessed_value` + 1),
    `list_year` = log(`list_year` + 1)
  )

# Remove outliers from the target variable
# Calculate Q1 and Q3
Q1 <- quantile(Realestate_sample$sale_amount, 0.25, na.rm = TRUE)
Q3 <- quantile(Realestate_sample$sale_amount, 0.75, na.rm = TRUE)

# Calculate IQR
IQR_value <- Q3 - Q1

# Define bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Filter dataset to remove outliers
Realestate_sample <- Realestate_sample[Realestate_sample$sale_amount >= lower_bound &
  Realestate_sample$sale_amount <= upper_bound, ]


# Rescale the target variable by a factor of 1000 to give more manageble figures
Realestate_sample$sale_amount <- Realestate_sample$sale_amount / 1000

# Check the target variable
# Minimum value
min_value <- min(Realestate_sample$sale_amount, na.rm = TRUE)

# Maximum value
max_value <- max(Realestate_sample$sale_amount, na.rm = TRUE)

# Print results
cat("Minimum Sale Amount:", min_value, "\n")
Minimum Sale Amount: 4.5 
cat("Maximum Sale Amount:", max_value, "\n")
Maximum Sale Amount: 765 
# Basic histogram of sale_amount
hist(Realestate_sample$log_sale,
  main = "Histogram of Sale Amount",
  xlab = "Sale Amount (in thousands)",
  col = "lightblue",
  border = "black"
)

#
  1. Report the number of observations, columns (with their meaning) and their data types. Indicate clearly what you will use as dependent variable/label.

After pre-processing of our dataset we have 518 observations and 15 variables remaining.

Our dependent variable will be the sale_amount (numeric), which is the price by which the object has been sold. The dependent variable has been divided by one-thousand to get make the number more manageable to use.

As independent variables we are going to use the following columns: 1. list_year: the year at which the object was listed for sale as a log transformed number. 2. Town (categorical): the town in which the property is located as a factor with 107 levels 3. log_assessed: the assessed value of the object before the sale as a log transformed number (numeric variable). 4. Season: the season in which the object was sold as a integer from 1 to 4, making it categorical. 5. non_use_code (categorical): code indicating properties that may not be used for typical purposes (e.g., vacant land) as a integer. 6. Property.Type and 7. Residential.Type categorical variablees describe the property’s classification type as property and usage. Numeric variables lon, and lat capture the property’s financial valuation and geographic location.

2. Split the data and tranform columns as necessary. (0.5pt)

Split the data into training (80%) and test set (80%). Transform the columns if necessary.

# Number of observations

n <- nrow(Realestate_sample)

# Create a random sample of 80% row indices
train_indices <- sample(seq_len(n), size = 0.8 * n)

# Split the data
train_data <- Realestate_sample[train_indices, ]
test_data <- Realestate_sample[-train_indices, ]

3. Model Exploration (3pt)

  1. Fit multiple appropriate models to the dataset (as many models as there are members in the group, with a minimum of two models). Models might vary in the multilevel structure, informativeness of their priors (but not just trivial changes), model of the data/likelihood, etc. (I recommend not to use no pooling models since they tend to take a long time and it’s very hard to assign good priors). Assess if models converged: Report how the traceplots looks like, highest Rhat, number of effective samples, etc. If didn’t converge, address the issues. (If you can’t solve the problems, report them and continue with the assignment).
# models go here
# Calculating Gelman et al. priors.
intercept_prior <- mean(Realestate_sample$log_sale)
SD <- sd(Realestate_sample$log_sale)
slope_prior <- 2.5 * sd(Realestate_sample$log_sale)
SD_prior <- 1 / sd(Realestate_sample$log_sale)

intercept_prior
[1] 12.06991
SD
[1] 0.9424508
slope_prior
[1] 2.356127
SD_prior
[1] 1.061063
def_priors_s <- c(
  prior(normal(12, 0.94), class = Intercept),
  prior(normal(0, 2.33), class = b),
  prior(exponential(1.07), class = sigma)
)

# Fitting a first model using pooling based on cities.

Sale_Price_1 <- brm(log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 | Town)),
  data = train_data,
  family = gaussian(),
  prior = def_priors_s,
  iter = 4000,
  chains = 4,
  seed = 123,
  file = "Sale_Price_1"
)

Sale_Price_1
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 | Town)) 
   Data: train_data (Number of observations: 414) 
  Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 8000

Multilevel Hyperparameters:
~Town (Number of levels: 99) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.20      0.05     0.11     0.30 1.00     2638     4745

Regression Coefficients:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         2.35     18.02   -33.36    37.65 1.00    12711     5049
list_year         0.28      2.37    -4.39     4.97 1.00    12717     4984
log_assessed1     0.71      0.04     0.64     0.78 1.00     9819     6523
season            0.04      0.03    -0.01     0.10 1.00    12787     5814
non_use_code     -0.09      0.01    -0.11    -0.08 1.00    10871     6074

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.58      0.02     0.54     0.62 1.00     7630     5874

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Rhat = 1.00 across all parameters show chains converged & Bulk_ESS and Tail_ESS are high, indicating reliable estimates. Bulk_ESS ranges from 2638 (for sd(Intercept)) up to 12787 (season). Tail_ESS ranges from 4745 for sd(Intercept) up to 6523 for log_assessed1.

The intercept represents the expected log sale price of 2.35 when all predictors are zero (because we have not centered them). There is a modest difference in log sale price across towns, with the average differing by 0.20 (or exp(0.20) = 22%) from the global average.

The intercept and list_year have very large uncertainty. The model estimates a small effect for list_year,with a large standard error, and very wide 95% credible interval crossing 0, suggesting that list_year has no clear effect. The model assumes a linear effect of year, which may be inaccurate. Furthermore, the values for the predictors at 0 are not meaningful for real-world settings, so this result is not “interpretable” for real scenarios.

# Fitting a second model, now also allowing the slopes to vary with the assessed value this because it's reasonable to assume that the are differences in the assessed value per town. Next to that we have also added a horseshoe prior because it might be logical to think that not all variables add information.

def_priors_hs <- c(
  prior(normal(12, 0.94), class = Intercept),
  prior(horseshoe(par_ratio = 0.2), class = b),
  prior(exponential(1.07), class = sigma)
)

Sale_Price_2 <- brm(log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 + log_assessed1 | Town)),
  data = train_data,
  family = gaussian(),
  prior = def_priors_hs,
  iter = 4000,
  control = list(max_treedepth = 12, adapt_delta = 0.999),
  chains = 4,
  seed = 123,
  file = "Sale_Price_2"
)

Sale_Price_2
Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above
0.999 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log_sale ~ (list_year + log_assessed1 + season + non_use_code + (1 + log_assessed1 | Town)) 
   Data: train_data (Number of observations: 414) 
  Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 8000

Multilevel Hyperparameters:
~Town (Number of levels: 99) 
                             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                    3.33      0.56     2.36     4.54 1.00     2163     3411
sd(log_assessed1)                0.27      0.05     0.19     0.38 1.00     2188     3373
cor(Intercept,log_assessed1)    -1.00      0.00    -1.00    -1.00 1.00     2340     3119

Regression Coefficients:
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         3.29      3.49    -3.60     8.68 1.00     6053     5649
list_year         0.02      0.45    -0.70     0.91 1.00     8317     5738
log_assessed1     0.80      0.06     0.68     0.91 1.00     3876     5038
season            0.02      0.02    -0.02     0.06 1.00     5237     7940
non_use_code     -0.09      0.01    -0.10    -0.07 1.00     8116     5796

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.52      0.02     0.48     0.56 1.00     6358     5938

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

All R-hat values show good convergence at 1.00. Next to this Bulk ESS and Tail ESS are all sufficiently high.

The model intercept estimates the log sale price to be 3.29 ( exp ≈ $27) when all predictors (list_year, log_assessed1, season, and non_use_code) are zero, but with a wide uncertainty. This estimation and interpretation, however, is not the most meaningful in real-world cases. The list_year, season show weak effects which might not be useful. There’s stronger variability in log sale prices across towns (3.33), much higher than in model 1, suggesting town differences are more variable now that the slopes also vary with log_assessed.

It is good to keep in mind that the since intercept represents the expected log sale price when all predictors are at zero, since they are not centered, these values are not meaningful in the context of the data (e.g., list_year), making it not directly interpretative in a real-world sense therefore, we need to center some predictors in a meaningful way. Because of this we have centered the list_year variable.

# updated model 3

# loosening the priors for the intercept and slopes
def_priors_loose <- c(
  prior(normal(12, 2), class = Intercept), # mean and sd from data
  prior(normal(0, 3.5), class = b), # looser slope prior
  prior(exponential(1.07), class = sigma)
)

# centering list_year with mean year and logging it for correct interpretation
train_data <- train_data %>%
  mutate(list_year_centered = list_year - mean(list_year, na.rm = TRUE))

# model 3
Sale_Price_3 <- brm(
  formula = log_sale ~ list_year_centered + log_assessed1 + season + non_use_code +
    (1 + log_assessed1 | town),
  data = train_data,
  family = gaussian(),
  prior = def_priors_loose,
  iter = 4000,
  chains = 4,
  seed = 123,
  control = list(adapt_delta = 0.999, max_treedepth = 12),
  file = "Sale_Price_3"
)

Sale_Price_3
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: log_sale ~ list_year_centered + log_assessed1 + season + non_use_code + (1 + log_assessed1 | town) 
   Data: train_data (Number of observations: 414) 
  Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 8000

Multilevel Hyperparameters:
~town (Number of levels: 99) 
                             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                    3.33      0.56     2.35     4.54 1.00     3247     5312
sd(log_assessed1)                0.27      0.05     0.19     0.38 1.00     3319     5323
cor(Intercept,log_assessed1)    -1.00      0.00    -1.00    -1.00 1.00     2635     4379

Regression Coefficients:
                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              3.33      0.71     1.92     4.69 1.00     4987     5984
list_year_centered     0.85      3.41    -5.92     7.58 1.00    17337     5511
log_assessed1          0.80      0.06     0.69     0.92 1.00     5021     5786
season                 0.03      0.03    -0.02     0.08 1.00    17376     6292
non_use_code          -0.09      0.01    -0.11    -0.07 1.00    16226     6236

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.52      0.02     0.48     0.56 1.00    11046     5478

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

All R-hat values are at 1.00, indicating good convergence. Bulk ESS and tail ESS are all sufficiently high.

The expected log sale price is 3.33 for an average year (log centered) with other predictors at zero, and average town effect. Much more reasonable than before but only for the year predictor. There is still large variability in baseline log sale prices across towns (3.33), but meaningful but small effect of log assessed across towns (0.27). The error sigma of 0.52 suggests relative noise.

The intervals for list year and intercept are the largest, also crossing zero for list_year_centered. This indicates there is still no real effect of list_year_centered on the dependent variable.

# Prior sensitivity analysis of model 1
plot(Sale_Price_1, ask = FALSE)

Traceplots demonstrate good mixing across all four chains, indicating good convergence. The average deviation of the intercept across towns is about 0.20 log-sale units; sigma is centered around 0.55–0.60 making it off by 1.8 for correct individual house price predictions. log_assessed1 and non_use_code are the only meaningful, clearly estimated predictors in this model with tighter confidence intervals. There is moderate variation in baseline prices across towns (0.20) = exp(0.20) = 22% difference in sales across towns

plot(Sale_Price_2, ask = FALSE)

The Intercept and list_year are shrunk towards zero, expected by the horseshoe prior, the traceplots do show some spikes out of the center of the traceplot.

log_assessed1: Histogram shows most values are > 0, with some spread. The traceplot shows active sampling across chains, not collapsed near zero.This predictor is important and not strongly shrunk; it contributes meaningfully to predicting log sale price.

season: Histogram shows highly concentrated values near 0. The traceplot is well mixed suggesting good convergence.

non_use_code: Histogram also shows values near 0, though with slightly more spread than season.The trace plot show similar patterns.

Further traceplots show some that are not well mixed, for example sdb_list_year.

plot(Sale_Price_3, ask = FALSE)

Model 3 shows mostly stable trace-plots that exceed the two previous models. Only cor_town_intercept_log_assessed1 shows a not well mixed traceplot.

  1. Explain each model and describe its structure (what they assume about potential population-level or group-level effects), and the type of priors used.

Model 1: Our aim was to establish a baseline model. We selected independent variables from the dataset that, based on common sense reasoning, are expected to have the strongest predictive value. These include: list_year, assessed_value, season, non_use_code, and town (used as a grouping variable). Given the distribution of assessed_value, we applied a log transformation to both this variable and the dependent variable to bring them into a more manageable range. For the priors, we opted for weakly informative Gelman priors.

Model 2: For the second model we now also allow the slopes to vary with the assessed value, because it’s reasonable to assume that the are differences in the assessed value per town. Next to that, we have added a horseshoe prior because it might be logical to think that not all variables add information. Due to warnings we got we have set the max_treedepth to 12 and the adopt_delta to 0.999.

Model 3: This final version is a multilevel structured model where the slopes vary with the assessed value per town, as in model 2. However, this model improves upon model 2 by widening the intercept and slopes to resolve prior-data conflicts from the prior analysis shown in model 2, while still maintaining some regularization. The priors are set back to Gelmen, as to avoid the over-regularization of the model 2. It also centers the logged list_year feature at the mean as a baseline, rather than 0 as the previous models, making it more real-life interpretible. The population fixed effects are the intercept, list_year_centered, log_assessed, season, non_use_code, which show how they relate to log_price. The group level effects capture the random intercept and random slope, allowing each town to have its own baseline log_price, and additionally each town has its own sensitivity to the assessed value. The priors are loosened enough to let the data speak more, still weakly informative. We used a normal(12, 2) prior on the intercept to center it at the mean log_price with a ± 2-unit deviation, a normal(0, 3.5) prior on all slopes to allow moderately large effects while still shrinking any excessive noise, and an exponential(1.07) prior on sigma to match the data’s residual variability. We additionally used Guassian as a likelihood again, for a comparable model result, despite prices being positive and right-skewed (other distributions were omitted for comparability purposes).

4. Model checking (3pt)

  1. Perform a prior sensitivity analysis for each model and modify the model if appropriate. Justify.
# Prior sensitivity analysis of model 1
powerscale_sensitivity(Sale_Price_1)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data

             variable prior likelihood                                diagnosis
          b_Intercept 0.089      0.033 potential strong prior / weak likelihood
          b_list_year 0.090      0.037 potential strong prior / weak likelihood
      b_log_assessed1 0.002      0.200                                        -
             b_season 0.002      0.108                                        -
       b_non_use_code 0.002      0.104                                        -
   sd_Town__Intercept 0.003      0.504                                        -
                sigma 0.005      0.380                                        -
            Intercept 0.003      0.063                                        -
  r_Town[1,Intercept] 0.003      0.076                                        -
  r_Town[2,Intercept] 0.004      0.154                                        -
  r_Town[3,Intercept] 0.002      0.140                                        -
  r_Town[4,Intercept] 0.002      0.246                                        -
  r_Town[5,Intercept] 0.004      0.181                                        -
  r_Town[6,Intercept] 0.003      0.162                                        -
  r_Town[7,Intercept] 0.003      0.236                                        -
  r_Town[8,Intercept] 0.003      0.124                                        -
 r_Town[10,Intercept] 0.002      0.050                                        -
 r_Town[11,Intercept] 0.003      0.230                                        -
 r_Town[12,Intercept] 0.003      0.442                                        -
 r_Town[13,Intercept] 0.003      0.227                                        -
 r_Town[14,Intercept] 0.005      0.288                                        -
 r_Town[15,Intercept] 0.002      0.112                                        -
 r_Town[16,Intercept] 0.002      0.167                                        -
 r_Town[18,Intercept] 0.004      0.155                                        -
 r_Town[19,Intercept] 0.004      0.174                                        -
 r_Town[20,Intercept] 0.003      0.273                                        -
 r_Town[21,Intercept] 0.003      0.192                                        -
 r_Town[22,Intercept] 0.003      0.148                                        -
 r_Town[23,Intercept] 0.003      0.089                                        -
 r_Town[24,Intercept] 0.002      0.041                                        -
 [ reached 'max' / getOption("max.print") -- omitted 77 rows ]
# Prior sensitivity analysis of model 2
powerscale_sensitivity(Sale_Price_2)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data

                           variable prior likelihood
                        b_Intercept 0.075      0.027
                        b_list_year 0.079      0.024
                    b_log_assessed1 0.016      0.115
                           b_season 0.087      0.070
                     b_non_use_code 0.014      0.135
                 sd_Town__Intercept 0.050      0.245
             sd_Town__log_assessed1 0.050      0.242
 cor_Town__Intercept__log_assessed1 0.041      0.390
                              sigma 0.011      0.437
                          hs_global 1.452      0.357
                            hs_slab 1.103      0.193
                      sdb_list_year 0.628      0.068
                  sdb_log_assessed1 0.155      0.047
                         sdb_season 0.743      0.095
                   sdb_non_use_code 0.439      0.113
                          Intercept 0.008      0.037
                r_Town[1,Intercept] 0.009      0.033
                r_Town[2,Intercept] 0.017      0.116
                r_Town[3,Intercept] 0.016      0.055
                r_Town[4,Intercept] 0.018      0.060
                r_Town[5,Intercept] 0.014      0.156
                r_Town[6,Intercept] 0.012      0.064
                r_Town[7,Intercept] 0.011      0.126
                r_Town[8,Intercept] 0.009      0.042
               r_Town[10,Intercept] 0.012      0.070
               r_Town[11,Intercept] 0.018      0.153
               r_Town[12,Intercept] 0.010      0.269
               r_Town[13,Intercept] 0.011      0.050
               r_Town[14,Intercept] 0.016      0.070
               r_Town[15,Intercept] 0.012      0.033
                                diagnosis
 potential strong prior / weak likelihood
 potential strong prior / weak likelihood
                                        -
            potential prior-data conflict
                                        -
                                        -
            potential prior-data conflict
                                        -
                                        -
            potential prior-data conflict
            potential prior-data conflict
            potential prior-data conflict
 potential strong prior / weak likelihood
            potential prior-data conflict
            potential prior-data conflict
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
                                        -
 [ reached 'max' / getOption("max.print") -- omitted 184 rows ]
# Prior sensitivity analysis of model 3
powerscale_sensitivity(Sale_Price_3)
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data

                           variable prior likelihood                     diagnosis
                        b_Intercept 0.011      0.101                             -
               b_list_year_centered 0.095      0.052 potential prior-data conflict
                    b_log_assessed1 0.011      0.098                             -
                           b_season 0.003      0.103                             -
                     b_non_use_code 0.004      0.134                             -
                 sd_town__Intercept 0.046      0.234                             -
             sd_town__log_assessed1 0.046      0.229                             -
 cor_town__Intercept__log_assessed1 0.036      0.426                             -
                              sigma 0.002      0.416                             -
                          Intercept 0.002      0.052                             -
                r_town[1,Intercept] 0.005      0.052                             -
                r_town[2,Intercept] 0.009      0.090                             -
                r_town[3,Intercept] 0.007      0.052                             -
                r_town[4,Intercept] 0.007      0.060                             -
                r_town[5,Intercept] 0.011      0.127                             -
                r_town[6,Intercept] 0.007      0.030                             -
                r_town[7,Intercept] 0.009      0.102                             -
                r_town[8,Intercept] 0.008      0.042                             -
               r_town[10,Intercept] 0.008      0.065                             -
               r_town[11,Intercept] 0.012      0.098                             -
               r_town[12,Intercept] 0.014      0.285                             -
               r_town[13,Intercept] 0.007      0.045                             -
               r_town[14,Intercept] 0.009      0.090                             -
               r_town[15,Intercept] 0.009      0.078                             -
               r_town[16,Intercept] 0.007      0.033                             -
               r_town[18,Intercept] 0.007      0.038                             -
               r_town[19,Intercept] 0.008      0.097                             -
               r_town[20,Intercept] 0.008      0.166                             -
               r_town[21,Intercept] 0.009      0.066                             -
               r_town[22,Intercept] 0.007      0.038                             -
 [ reached 'max' / getOption("max.print") -- omitted 178 rows ]

Powerscale sensitivity analysis of model 1 shows weak likelihood/too overpowering priors for the intercept and b_list_year feature, suggesting they are too restricted or tight.

In model 2, again, strong priors for the intercept and b_list_year, with additional prior-data mismatch for season, sd_Town__Intercept, etc., possibly due to over-regularization from the horseshoe prior on the slopes, causing more shrinkage of several effects.

In model 3 we reverted back to Gelman priors which we loosened up for the intercept and slopes. This removed most of the prior-data conflicts. The only prior-data conflict that remained was that for the b_list_year_centered variable.

  1. Conduct posterior predictive checks for each model to assess how well they fit the data. Explain what you conclude.
# Posterior predictive checks for model 1
pp_check(Sale_Price_1, type = "intervals", x = "log_sale", prob_outer = 0.95)

# Posterior predictive checks for model 1
pp_check(Sale_Price_1, ndraws = 200)

# Posterior predictive checks for model 1
pp_check(Sale_Price_1, type = "stat_2d")

# Posterior predictive checks for model 2
pp_check(Sale_Price_2, type = "intervals", x = "log_sale", prob_outer = 0.95)

# Posterior predictive checks for model 2
pp_check(Sale_Price_2, ndraws = 200)

# Posterior predictive checks for model 2
pp_check(Sale_Price_2, type = "stat_2d")

# Posterior predictive checks for model 3
pp_check(Sale_Price_3, type = "intervals", x = "log_sale", prob_outer = 0.95)

# Posterior predictive checks for model 3
pp_check(Sale_Price_3, ndraws = 200)

# Posterior predictive checks for model 3
pp_check(Sale_Price_3, type = "stat_2d")

Model 1: The posterior predictive checks show that the models doesn’t give good predictions for the low region. Possible due to the lack of data in this region. The density increases with higher log_sale values, but the interval is wide suggesting a high uncertainty. The model is most confident, showing tighter predictive intervals, for the mid region of the data. The density overlays plot shows the model captures the overall data shape quite well, being a bit on the low side however. The observed distribution (dark blue line) mostly lies within the predictive light blue lines. The last plot of the posterior predictive check shows that the model’s simulated distributions of log_sale closely match the observed data for both the mean and standard deviation. The observed dark blue point lies within the simulated points (light blue cloud), which indicates a good fit.

Model 2: The posterior predictive checks for model 2 give the same image as for model 1 but now showing less uncertainty for the major part of the first plot, indicating it has a good fit.

Model 3: The posterior predictive checks show the same results for model 3 as for model 2.

5. Model Comparison (1.5pt)

  1. Use k-fold cross-validation to compare the models.
set.seed(123)
k <- loo::kfold_split_random(K = 10, N = nrow(train_data))
Sale_Price_1 <- kfold(Sale_Price_1, chains = 1, folds = k, save_fits = FALSE)
Sale_Price_2 <- kfold(Sale_Price_2, chains = 1, folds = k, save_fits = FALSE)
Sale_Price_3 <- kfold(Sale_Price_3, chains = 1, folds = k, save_fits = FALSE)

print(Sale_Price_1)

Based on 10-fold cross-validation.

           Estimate   SE
elpd_kfold   -378.7 25.8
p_kfold        33.2  5.8
kfoldic       757.5 51.6
print(Sale_Price_2)

Based on 10-fold cross-validation.

           Estimate   SE
elpd_kfold   -353.5 28.0
p_kfold        59.8 11.1
kfoldic       707.0 55.9
print(Sale_Price_3)

Based on 10-fold cross-validation.

           Estimate   SE
elpd_kfold   -353.9 28.1
p_kfold        60.5 11.5
kfoldic       707.9 56.2
set.seed(123)
# Question 5
loo_compare(Sale_Price_1, Sale_Price_2, Sale_Price_3)
             elpd_diff se_diff
Sale_Price_2   0.0       0.0  
Sale_Price_3  -0.4       1.2  
Sale_Price_1 -25.3      14.7  
# Question 6
fit <- readRDS("Sale_Price_2.rds")

fx <- fixef(fit)[-1, ]
print(head(fx[order(abs(fx[, "Estimate"]), decreasing = TRUE), ], 5), digits = 3)
              Estimate Est.Error    Q2.5   Q97.5
log_assessed1   0.7952   0.05830  0.6847  0.9120
non_use_code   -0.0869   0.00863 -0.1039 -0.0701
list_year       0.0224   0.45139 -0.6997  0.9090
season          0.0151   0.02102 -0.0181  0.0633
# Question 7
set.seed(123)
pred <- exp(colMeans(posterior_predict(fit, newdata = test_data, allow_new_levels = TRUE)))
obs <- exp(test_data$log_sale)

rmse <- sqrt(mean((pred - obs)^2))
mae <- mean(abs(pred - obs))
mape <- mean(abs((pred - obs) / obs)) * 100

cat("RMSE =", rmse, "\n")
RMSE = 123544.7 
cat("MAE  =", mae, "\n")
MAE  = 83892.18 
cat("MAPE =", mape, "%\n")
MAPE = 62.32736 %
  1. Determine the best model based on predictive accuracy and justify your decision.

If we compare elpd_diff and se_diff across the three models, we can see that Sale_Price_2 is the best model with both elpd_diff and se_diff being zero, also used as baseline. Sale_Price_1 has an elpd_diff of −25.3 and a se_diff of 14.7. The elpd_diff does differ at least 4 from the baseline model but if we look at the se_diff we can see that it is large and it doesn’t pass the 2 * SE rule. This indicates that Sale_Price_2 and Sale_Price_1 don’t differ meaningfully.

Sale_Price_3 falls within the previous two models with a elpd_diff of -0.4 and a se_diff of 1.2. Again, the difference is below the threshold of 2 * se_diff and the threshold of 4 for the elpd_diff. Overall, Sale_Price_2 is chosen, but with only weak evidence and not being meaningfully better in predictive performance than the other models.

6. Interpretation of Important Parameters (1.5pt)

Sale prices in this model rise almost proportionally with assessed value: a one-percent increase in the assessed figure translates into roughly a 0.8 percent increase in the expected sale price, so doubling the assessment more than doubles what a property is likely to fetch. By contrast, parcels classified as “Vacant” carry a clear penalty, selling for about eight percent less than comparable properties with the baseline use-code. The calendar year of listing hints at a two-percent annual upward drift in prices, but the wide uncertainty band around that estimate means the data do not rule out the possibility of no real trend at all. Seasonality shows an even smaller, statistically unclear effect, implying that when the property is listed within the year probably does not have much of an effect when taking other factors into account.

7. Report a loss function on the test set (Optional for bonus 0.5 to 1pt, depending on if you use RMSE or another function).

Report RMSE or other loss (or utility) function on the test set. (Transform it back if necessary).

We chose RMSE and MAE as loss functions. We first transformed the data from logscale to the original scale, to ensure that the RMSE and MAE are interpretable. The RSME and MAE were 123682.4 and 83957.06 respectively. For the sake of interpretation, we also included MAPE to understand the relative magnitude of the error, which was reported to be 62.3%.

Contributions of each member

  • Jonathan Otterspeer: model 2, dataset pre-processing, output interpretation, and template code, 5,6,7

  • Niya: model 3, model checking, output interpretation, PDF file, references, questions 1a, 3b, 4a b

  • Johan Vriend: model 1, dataset selection, output interpretation, setting-up code template, 1b, 2, 3a,b

Statement of technology

During the making of this assignment Chat-GPT was used to deepen knowledge provided during this course. Example of a prompt used: You are a statistician specialized in Bayesian statistics. Explain the exact workings of the horseshoe prior in Bayesian statistics in normal human language. There has been no use of AI tools in assessing and evaluating the models.

References

Masidekabakci. (2024, December 30). Real Estate sales 2001-2022. Kaggle. https://www.kaggle.com/code/masidekabakci/real-estate-sales-2001-2022